{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pandas import Series, DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 1\n",
    "\n",
    "Create a data frame from four other columns (`VendorID`, `trip_distance`, `tip_amount`, and `total_amount`), specifying the `dtype` for each. What types are most appropriate? Can you use them directly, or must you first clean the data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/rr/0mnyyv811fs5vyp22gf4fxk00000gn/T/ipykernel_72055/3862543684.py:9: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
      "  df.loc['VendorID'] = df['VendorID'].astype(np.int8)\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_csv('../data/nyc_taxi_2020-01.csv',\n",
    "                usecols=['VendorID', 'trip_distance', 'tip_amount', 'total_amount'],\n",
    "                dtype={'VendorID':np.float32,\n",
    "                       'trip_distance':np.float32, \n",
    "                       'tip_amount':np.float32,\n",
    "                       'total_amount':np.float32})\n",
    "\n",
    "df = df.dropna().copy()\n",
    "df.loc['VendorID'] = df['VendorID'].astype(np.int8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 2\n",
    "\n",
    "Instead of removing `NaN` values from the `VendorID` column, set it to a new value, 3. How does that affect your specifications and cleaning of the data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('../data/nyc_taxi_2020-01.csv',\n",
    "            usecols=['VendorID', 'trip_distance', 'tip_amount', 'total_amount'],\n",
    "            dtype={'VendorID':np.float32, \n",
    "                   'trip_distance':np.float32,\n",
    "                   'tip_amount':np.float32,\n",
    "                  'total_amount':np.float32})\n",
    "\n",
    "df['VendorID'] = df['VendorID'].fillna(3)\n",
    "df['VendorID'] = df['VendorID'].astype(np.int8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 3\n",
    "\n",
    "We'll talk more about this in future chapters, but the `memory_usage` method allows you to see how much memory is being used by each column in a data frame. It returns a series of integers, in which the index lists the columns and the values represent the memory used by each column. Compare the memory used by the data frame with `float16` (which you've already used) and when you use `float64` instead for the final three columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "83265236"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Memory usage with float16\n",
    "df.memory_usage().sum()  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('../data/nyc_taxi_2020-01.csv',\n",
    "            usecols=['VendorID', 'trip_distance', 'tip_amount', 'total_amount'],\n",
    "            dtype={'VendorID':np.float32, \n",
    "                   'trip_distance':np.float32,\n",
    "                   'tip_amount':np.float32,\n",
    "                  'total_amount':np.float32})\n",
    "\n",
    "df['VendorID'] = df['VendorID'].fillna(3)\n",
    "df['VendorID'] = df['VendorID'].astype(np.int8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Memory usage with float64\n",
    "df.memory_usage().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# float64 uses about 3.5x the memory as float16!\n",
    "160125328 / 44835184"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
